Host transcriptomics and machine learning for secondary bacterial infections in patients with COVID-19: a prospective, observational cohort study

Lancet Microbe. 2024 Mar;5(3):e272-e281. doi: 10.1016/S2666-5247(23)00363-4. Epub 2024 Feb 1.

Abstract

Background: Viral respiratory tract infections are frequently complicated by secondary bacterial infections. This study aimed to use machine learning to predict the risk of bacterial superinfection in SARS-CoV-2-positive individuals.

Methods: In this prospective, multicentre, observational cohort study done in nine centres in six countries (Australia, Indonesia, Singapore, Italy, Czechia, and France) blood samples and RNA sequencing were used to develop a robust model of predicting secondary bacterial infections in the respiratory tract of patients with COVID-19. Eligible participants were older than 18 years, had known or suspected COVID-19, and symptoms of a recent respiratory infection. A control cohort of participants without COVID-19 who were older than 18 years and with no infection symptoms was also recruited from one Australian centre. In the pre-analysis phase, data were filtered to include only individuals with complete blood transcriptomics and patient data (ie, age, sex, location, and WHO severity score at the time of sample collection). The dataset was then divided randomly (4:1) into a training set (80%) and a test set (20%). Gene expression data in the training set and control cohort were used for differential expression analysis. Differentially expressed genes, along with WHO severity score, location, age, and sex, were used for feature selection with least absolute shrinkage and selection operator (LASSO) in the training set. For LASSO analysis, samples were excluded if gene expression data were not obtained at study admission, no longitudinal clinical information was available, a bacterial infection at the time of study admission was present, or a fungal infection in the absence of a bacterial infection was detected. LASSO regression was performed using three subsets of predictor variables: patient data alone, gene expression data alone, or a combination of patient data and gene expression data. The accuracy of the resultant models was tested on data from the test set.

Findings: Between March, 2020, and October, 2021, we recruited 536 SARS-CoV-2-positive individuals and between June, 2013, and January, 2020, we recruited 74 participants into the control cohort. After prefiltering analysis and other exclusions, samples from 158 individuals were analysed in the training set and 47 in the test set. The expression of seven host genes (DAPP1, CST3, FGL2, GCH1, CIITA, UPP1, and RN7SL1) in the blood at the time of study admission was identified by LASSO as predictive of the risk of developing a secondary bacterial infection of the respiratory tract more than 24 h after study admission. Specifically, the expression of these genes in combination with a patient's WHO severity score at the time of study enrolment resulted in an area under the curve of 0·98 (95% CI 0·89-1·00), a true positive rate (sensitivity) of 1·00 (95% CI 1·00-1·00), and a true negative rate (specificity) of 0·94 (95% CI 0·89-1·00) in the test cohort. The combination of patient data and host transcriptomics at hospital admission identified all seven individuals in the training and test sets who developed a bacterial infection of the respiratory tract 5-9 days after hospital admission.

Interpretation: These data raise the possibility that host transcriptomics at the time of clinical presentation, together with machine learning, can forward predict the risk of secondary bacterial infections and allow for the more targeted use of antibiotics in viral infection.

Funding: Snow Medical Research Foundation, the National Health and Medical Research Council, the Jack Ma Foundation, the Helmholtz-Association, the A2 Milk Company, National Institute of Allergy and Infectious Disease, and the Fondazione AIRC Associazione Italiana per la Ricerca contro il Cancro.

Publication types

  • Observational Study
  • Multicenter Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Australia / epidemiology
  • Bacterial Infections*
  • COVID-19* / epidemiology
  • Cohort Studies
  • Fibrinogen
  • Gene Expression Profiling
  • Humans
  • Machine Learning
  • Prospective Studies
  • SARS-CoV-2 / genetics

Substances

  • FGL2 protein, human
  • Fibrinogen